Begin by loading the dataset and generating a comprehensive summary of the data including the count, mean, standard deviation, minimum, maximum, and quartiles for each numerical feature. This will provide an initial understanding of the data's central tendency and spread. Also, review the data types of each column to confirm they are appropriate for their content.
import pandas as pd
import numpy as np
# Display summary statistics for numerical columns
print("Summary Statistics for Numerical Features:")
print(df.describe())
# Display data types of all columns
print("\nData Types of Each Column:")
print(df.dtypes)
Summary Statistics for Numerical Features:
Area Perimeter Major_Axis_Length Minor_Axis_Length \
count 2500.000000 2500.000000 2500.000000 2500.000000
mean 80658.220800 1130.279015 456.601840 225.794921
std 13664.510228 109.256418 56.235704 23.297245
min 47939.000000 868.485000 320.844600 152.171800
25% 70765.000000 1048.829750 414.957850 211.245925
50% 79076.000000 1123.672000 449.496600 224.703100
75% 89757.500000 1203.340500 492.737650 240.672875
max 136574.000000 1559.450000 661.911300 305.818000
Convex_Area Equiv_Diameter Eccentricity Solidity Extent \
count 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000
mean 81508.084400 319.334230 0.860879 0.989492 0.693205
std 13764.092788 26.891920 0.045167 0.003494 0.060914
min 48366.000000 247.058400 0.492100 0.918600 0.468000
25% 71512.000000 300.167975 0.831700 0.988300 0.658900
50% 79872.000000 317.305350 0.863700 0.990300 0.713050
75% 90797.750000 338.057375 0.897025 0.991500 0.740225
max 138384.000000 417.002900 0.948100 0.994400 0.829600
Roundness Aspect_Ration Compactness
count 2500.000000 2500.000000 2500.000000
mean 0.791533 2.041702 0.704121
std 0.055924 0.315997 0.053067
min 0.554600 1.148700 0.560800
25% 0.751900 1.801050 0.663475
50% 0.797750 1.984200 0.707700
75% 0.834325 2.262075 0.743500
max 0.939600 3.144400 0.904900
Data Types of Each Column:
Area int64
Perimeter float64
Major_Axis_Length float64
Minor_Axis_Length float64
Convex_Area int64
Equiv_Diameter float64
Eccentricity float64
Solidity float64
Extent float64
Roundness float64
Aspect_Ration float64
Compactness float64
Class object
dtype: object
Examine the dataset for any missing or null values across all columns and assess the data types of each feature to ensure they are correctly assigned (e.g., numerical features as floats or integers, categorical features as objects). This step helps identify data cleaning needs and confirms data integrity.
import pandas as pd
import numpy as np
# Check for missing/null values in each column
missing_values = df.isnull().sum()
print("Missing or Null Values per Column:")
print(missing_values)
print("\nData Types of Each Column:")
print(df.dtypes)
Missing or Null Values per Column:
Area 0
Perimeter 0
Major_Axis_Length 0
Minor_Axis_Length 0
Convex_Area 0
Equiv_Diameter 0
Eccentricity 0
Solidity 0
Extent 0
Roundness 0
Aspect_Ration 0
Compactness 0
Class 0
dtype: int64
Data Types of Each Column:
Area int64
Perimeter float64
Major_Axis_Length float64
Minor_Axis_Length float64
Convex_Area int64
Equiv_Diameter float64
Eccentricity float64
Solidity float64
Extent float64
Roundness float64
Aspect_Ration float64
Compactness float64
Class object
dtype: object
In addition to generating summary statistics for each numerical feature, create visualizations such as histograms or density plots for each feature to analyze their distribution. This will help in understanding the skewness, modality, and range of the data, which is crucial for deciding if transformations or scaling are necessary later in the pipeline. Ensure that these plots are clearly labeled and interpreted to provide insights into the data distribution.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Select numerical columns only (exclude 'Class' which is object)
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Summary Statistics for Numerical Features:")
print(df[numerical_cols].describe())
# Plot distribution for each numerical feature
for col in numerical_cols:
plt.figure(figsize=(8, 5))
sns.histplot(df[col], kde=True, bins=30, color='skyblue')
plt.title(f'Distribution of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.grid(True, linestyle='--', alpha=0.6)
# Interpretation printout for each feature
skewness = df[col].skew()
modality = "likely unimodal"
# Simple modality check: count peaks in KDE curve (approximate)
# Here we just note skewness for simplicity
print(f"\nFeature: {col}")
print(f" - Skewness: {skewness:.3f}")
if skewness > 1:
print(" - Interpretation: Highly right-skewed distribution, consider transformation.")
elif skewness < -1:
print(" - Interpretation: Highly left-skewed distribution, consider transformation.")
else:
print(" - Interpretation: Approximately symmetric distribution.")
Summary Statistics for Numerical Features:
Area Perimeter Major_Axis_Length Minor_Axis_Length \
count 2500.000000 2500.000000 2500.000000 2500.000000
mean 80658.220800 1130.279015 456.601840 225.794921
std 13664.510228 109.256418 56.235704 23.297245
min 47939.000000 868.485000 320.844600 152.171800
25% 70765.000000 1048.829750 414.957850 211.245925
50% 79076.000000 1123.672000 449.496600 224.703100
75% 89757.500000 1203.340500 492.737650 240.672875
max 136574.000000 1559.450000 661.911300 305.818000
Convex_Area Equiv_Diameter Eccentricity Solidity Extent \
count 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000
mean 81508.084400 319.334230 0.860879 0.989492 0.693205
std 13764.092788 26.891920 0.045167 0.003494 0.060914
min 48366.000000 247.058400 0.492100 0.918600 0.468000
25% 71512.000000 300.167975 0.831700 0.988300 0.658900
50% 79872.000000 317.305350 0.863700 0.990300 0.713050
75% 90797.750000 338.057375 0.897025 0.991500 0.740225
max 138384.000000 417.002900 0.948100 0.994400 0.829600
Roundness Aspect_Ration Compactness
count 2500.000000 2500.000000 2500.000000
mean 0.791533 2.041702 0.704121
std 0.055924 0.315997 0.053067
min 0.554600 1.148700 0.560800
25% 0.751900 1.801050 0.663475
50% 0.797750 1.984200 0.707700
75% 0.834325 2.262075 0.743500
max 0.939600 3.144400 0.904900
Feature: Area
- Skewness: 0.496
- Interpretation: Approximately symmetric distribution.
Feature: Perimeter
- Skewness: 0.415
- Interpretation: Approximately symmetric distribution.
Feature: Major_Axis_Length
- Skewness: 0.503
- Interpretation: Approximately symmetric distribution.
Feature: Minor_Axis_Length
- Skewness: 0.104
- Interpretation: Approximately symmetric distribution.
Feature: Convex_Area
- Skewness: 0.494
- Interpretation: Approximately symmetric distribution.
Feature: Equiv_Diameter
- Skewness: 0.272
- Interpretation: Approximately symmetric distribution.
Feature: Eccentricity
- Skewness: -0.749
- Interpretation: Approximately symmetric distribution.
Feature: Solidity
- Skewness: -5.691
- Interpretation: Highly left-skewed distribution, consider transformation.
Feature: Extent
- Skewness: -1.027
- Interpretation: Highly left-skewed distribution, consider transformation.
Feature: Roundness
- Skewness: -0.373
- Interpretation: Approximately symmetric distribution.
Feature: Aspect_Ration
- Skewness: 0.548
- Interpretation: Approximately symmetric distribution.
Feature: Compactness
- Skewness: -0.062
- Interpretation: Approximately symmetric distribution.
Calculate the correlation matrix for all numerical features as done, and additionally create a heatmap visualization of this correlation matrix using a suitable plotting library (e.g., seaborn or matplotlib). This visualization should clearly highlight strong positive and negative correlations to help identify feature interdependencies and potential multicollinearity issues, fulfilling the original plan's intent.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Select numerical columns only
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
# Calculate correlation matrix
corr_matrix = df[numerical_cols].corr()
print("Correlation Matrix:")
print(corr_matrix)
# Plot heatmap of the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', center=0,
cbar_kws={"shrink": .8}, square=True, linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.grid(False)
Correlation Matrix:
Area Perimeter Major_Axis_Length Minor_Axis_Length \
Area 1.000000 0.928548 0.789133 0.685304
Perimeter 0.928548 1.000000 0.946181 0.392913
Major_Axis_Length 0.789133 0.946181 1.000000 0.099376
Minor_Axis_Length 0.685304 0.392913 0.099376 1.000000
Convex_Area 0.999806 0.929971 0.789061 0.685634
Equiv_Diameter 0.998464 0.928055 0.787078 0.690020
Eccentricity 0.159624 0.464601 0.704287 -0.590877
Solidity 0.158388 0.065340 0.119291 0.090915
Extent -0.014018 -0.140600 -0.214990 0.233576
Roundness -0.149378 -0.500968 -0.684972 0.558566
Aspect_Ration 0.159960 0.487880 0.729156 -0.598475
Compactness -0.160438 -0.484440 -0.726958 0.603441
Convex_Area Equiv_Diameter Eccentricity Solidity \
Area 0.999806 0.998464 0.159624 0.158388
Perimeter 0.929971 0.928055 0.464601 0.065340
Major_Axis_Length 0.789061 0.787078 0.704287 0.119291
Minor_Axis_Length 0.685634 0.690020 -0.590877 0.090915
Convex_Area 1.000000 0.998289 0.159156 0.139178
Equiv_Diameter 0.998289 1.000000 0.156246 0.159454
Eccentricity 0.159156 0.156246 1.000000 0.043991
Solidity 0.139178 0.159454 0.043991 1.000000
Extent -0.015449 -0.010970 -0.327316 0.067537
Roundness -0.153615 -0.145313 -0.890651 0.200836
Aspect_Ration 0.159822 0.155762 0.950225 0.026410
Compactness -0.160432 -0.156411 -0.981689 -0.019967
Extent Roundness Aspect_Ration Compactness
Area -0.014018 -0.149378 0.159960 -0.160438
Perimeter -0.140600 -0.500968 0.487880 -0.484440
Major_Axis_Length -0.214990 -0.684972 0.729156 -0.726958
Minor_Axis_Length 0.233576 0.558566 -0.598475 0.603441
Convex_Area -0.015449 -0.153615 0.159822 -0.160432
Equiv_Diameter -0.010970 -0.145313 0.155762 -0.156411
Eccentricity -0.327316 -0.890651 0.950225 -0.981689
Solidity 0.067537 0.200836 0.026410 -0.019967
Extent 1.000000 0.352338 -0.329933 0.336984
Roundness 0.352338 1.000000 -0.935233 0.933308
Aspect_Ration -0.329933 -0.935233 1.000000 -0.990778
Compactness 0.336984 0.933308 -0.990778 1.000000
Evaluate the distribution of the target variable 'Class' by counting the instances of each class label. This will help determine if the dataset is balanced or if class imbalance techniques might be required.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Count instances of each class label
class_counts = df['Class'].value_counts()
class_proportions = df['Class'].value_counts(normalize=True)
print("Class Counts:")
print(class_counts)
print("\nClass Proportions:")
print(class_proportions)
# Plot class distribution
plt.figure(figsize=(8, 5))
sns.barplot(x=class_counts.index, y=class_counts.values, palette='viridis')
plt.title('Distribution of Target Variable: Class')
plt.xlabel('Class')
plt.ylabel('Count')
plt.grid(axis='y', linestyle='--', alpha=0.7)
Class Counts:
Class
Çerçevelik 1300
Ürgüp Sivrisi 1200
Name: count, dtype: int64
Class Proportions:
Class
Çerçevelik 0.52
Ürgüp Sivrisi 0.48
Name: proportion, dtype: float64
Identify potential outliers in numerical features using statistical methods such as the interquartile range (IQR) or visualization techniques like boxplots. This step is crucial for understanding data variability and deciding on outlier treatment strategies.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Select numerical columns only
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Outlier Detection using IQR method:")
for col in numerical_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
num_outliers = outliers.shape[0]
print(f"\nFeature: {col}")
print(f" - Q1: {Q1:.3f}, Q3: {Q3:.3f}, IQR: {IQR:.3f}")
print(f" - Lower Bound: {lower_bound:.3f}, Upper Bound: {upper_bound:.3f}")
print(f" - Number of potential outliers: {num_outliers}")
# Boxplot to visualize outliers
plt.figure(figsize=(8, 5))
sns.boxplot(x=df[col], color='lightcoral')
plt.title(f'Boxplot of {col} (Outlier Detection)')
plt.xlabel(col)
plt.grid(True, linestyle='--', alpha=0.6)
Outlier Detection using IQR method:
Feature: Area
- Q1: 70765.000, Q3: 89757.500, IQR: 18992.500
- Lower Bound: 42276.250, Upper Bound: 118246.250
- Number of potential outliers: 18
Feature: Perimeter
- Q1: 1048.830, Q3: 1203.341, IQR: 154.511
- Lower Bound: 817.064, Upper Bound: 1435.107
- Number of potential outliers: 16
Feature: Major_Axis_Length
- Q1: 414.958, Q3: 492.738, IQR: 77.780
- Lower Bound: 298.288, Upper Bound: 609.407
- Number of potential outliers: 21
Feature: Minor_Axis_Length
- Q1: 211.246, Q3: 240.673, IQR: 29.427
- Lower Bound: 167.106, Upper Bound: 284.813
- Number of potential outliers: 30
Feature: Convex_Area
- Q1: 71512.000, Q3: 90797.750, IQR: 19285.750
- Lower Bound: 42583.375, Upper Bound: 119726.375
- Number of potential outliers: 17
Feature: Equiv_Diameter
- Q1: 300.168, Q3: 338.057, IQR: 37.889
- Lower Bound: 243.334, Upper Bound: 394.891
- Number of potential outliers: 13
Feature: Eccentricity
- Q1: 0.832, Q3: 0.897, IQR: 0.065
- Lower Bound: 0.734, Upper Bound: 0.995
- Number of potential outliers: 18
Feature: Solidity
- Q1: 0.988, Q3: 0.992, IQR: 0.003
- Lower Bound: 0.983, Upper Bound: 0.996
- Number of potential outliers: 103
Feature: Extent
- Q1: 0.659, Q3: 0.740, IQR: 0.081
- Lower Bound: 0.537, Upper Bound: 0.862
- Number of potential outliers: 46
Feature: Roundness
- Q1: 0.752, Q3: 0.834, IQR: 0.082
- Lower Bound: 0.628, Upper Bound: 0.958
- Number of potential outliers: 5
Feature: Aspect_Ration
- Q1: 1.801, Q3: 2.262, IQR: 0.461
- Lower Bound: 1.110, Upper Bound: 2.954
- Number of potential outliers: 11
Feature: Compactness
- Q1: 0.663, Q3: 0.744, IQR: 0.080
- Lower Bound: 0.543, Upper Bound: 0.864
- Number of potential outliers: 2
Visualize relationships between pairs of features, especially those with strong correlations (|correlation| > 0.7), using scatter plots or pair plots. Specifically, create pair plots for the identified strongly correlated features such as 'Area', 'Perimeter', 'Major_Axis_Length', 'Convex_Area', and 'Equiv_Diameter', including the 'Class' variable as hue to observe class-wise distributions. Ensure the visualizations are generated and saved or displayed to facilitate understanding of data structure and potential feature engineering opportunities.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Select numerical columns only
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
# Calculate correlation matrix
corr_matrix = df[numerical_cols].corr()
# Identify strongly correlated features with absolute correlation > 0.7 (excluding self-correlation)
strong_corr_pairs = []
for col1 in numerical_cols:
for col2 in numerical_cols:
if col1 != col2:
corr_val = corr_matrix.loc[col1, col2]
if abs(corr_val) > 0.7:
pair = tuple(sorted([col1, col2]))
if pair not in strong_corr_pairs:
strong_corr_pairs.append(pair)
# Extract unique features involved in strong correlations
strong_corr_features = sorted(set([feat for pair in strong_corr_pairs for feat in pair]))
# Ensure the specified features are included (as per instruction)
specified_features = ['Area', 'Perimeter', 'Major_Axis_Length', 'Convex_Area', 'Equiv_Diameter']
for feat in specified_features:
if feat not in strong_corr_features:
strong_corr_features.append(feat)
# Filter features to those present in df columns (just in case)
strong_corr_features = [f for f in strong_corr_features if f in df.columns]
print("Strongly correlated features (|corr| > 0.7):")
print(strong_corr_features)
# Create pair plot for these features with 'Class' as hue
plt.figure(figsize=(12, 10))
pairplot = sns.pairplot(df, vars=strong_corr_features, hue='Class', palette='Set2', diag_kind='kde', plot_kws={'alpha':0.6, 's':30})
pairplot.fig.suptitle('Pair Plot of Strongly Correlated Features with Class Hue', y=1.02)
# Save the pairplot figure
pairplot.savefig("../output/tables/pairplot_strong_corr_features.png")
print("Pair plot saved to '../output/tables/pairplot_strong_corr_features.png'")
Strongly correlated features (|corr| > 0.7):
['Area', 'Aspect_Ration', 'Compactness', 'Convex_Area', 'Eccentricity', 'Equiv_Diameter', 'Major_Axis_Length', 'Perimeter', 'Roundness']
Pair plot saved to '../output/tables/pairplot_strong_corr_features.png'
Perform data quality checks such as verifying unique values for categorical features, checking for duplicate rows, and ensuring consistency in feature scales and units. This step ensures the dataset is clean and reliable for subsequent modeling tasks.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Check unique values for categorical features
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("Unique values per categorical feature:")
for col in categorical_cols:
unique_vals = df[col].unique()
num_unique = df[col].nunique()
print(f" - {col}: {num_unique} unique values -> {unique_vals}")
# 2. Check for duplicate rows
num_duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows in dataset: {num_duplicates}")
# 3. Check consistency in feature scales and units
# - Summary statistics (already done previously, but we print min/max again for clarity)
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("\nNumerical features scale check (min and max values):")
for col in numerical_cols:
min_val = df[col].min()
max_val = df[col].max()
print(f" - {col}: min = {min_val}, max = {max_val}")
# 4. Visual check for scale consistency using boxplots for numerical features
for col in numerical_cols:
plt.figure(figsize=(8, 4))
sns.boxplot(x=df[col], color='lightblue')
plt.title(f'Boxplot for {col} to check scale and outliers')
plt.xlabel(col)
plt.grid(True, linestyle='--', alpha=0.6)
# 5. Check for inconsistent units or suspicious values by comparing related features
# For example, check if Area and Convex_Area are consistent (Convex_Area should be >= Area)
inconsistent_area = df[df['Convex_Area'] < df['Area']]
print(f"\nNumber of rows where Convex_Area < Area (possible inconsistency): {inconsistent_area.shape[0]}")
# Check if Aspect_Ration (Aspect_Ratio) values are reasonable (positive and within expected range)
if 'Aspect_Ration' in df.columns:
invalid_aspect_ratio = df[(df['Aspect_Ration'] <= 0) | (df['Aspect_Ration'] > 10)]
print(f"Number of rows with invalid Aspect_Ration values (<=0 or >10): {invalid_aspect_ratio.shape[0]}")
# Check for any missing values in the dataset
missing_values = df.isnull().sum()
print("\nMissing values per column:")
print(missing_values[missing_values > 0] if missing_values.any() else "No missing values detected.")
Unique values per categorical feature:
- Class: 2 unique values -> ['Çerçevelik' 'Ürgüp Sivrisi']
Number of duplicate rows in dataset: 0
Numerical features scale check (min and max values):
- Area: min = 47939, max = 136574
- Perimeter: min = 868.485, max = 1559.45
- Major_Axis_Length: min = 320.8446, max = 661.9113
- Minor_Axis_Length: min = 152.1718, max = 305.818
- Convex_Area: min = 48366, max = 138384
- Equiv_Diameter: min = 247.0584, max = 417.0029
- Eccentricity: min = 0.4921, max = 0.9481
- Solidity: min = 0.9186, max = 0.9944
- Extent: min = 0.468, max = 0.8296
- Roundness: min = 0.5546, max = 0.9396
- Aspect_Ration: min = 1.1487, max = 3.1444
- Compactness: min = 0.5608, max = 0.9049
Number of rows where Convex_Area < Area (possible inconsistency): 0
Number of rows with invalid Aspect_Ration values (<=0 or >10): 0
Missing values per column:
No missing values detected.